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Abstract: An important parameter in the study of population evolution is 
9 = ANv, where A'^ is the effective population size and u is the rate of mutation 
per locus per generation. Therefore, 9 represents the mean number of mutations 
per site per generation. There are many estimators of 9, one of them being 
the mean number of pairwisc nucleotide differences, which we call 72 ■ Other 
estimators are T\ , based on the number of segregating sites and Ts , based on 
the number of singletons. The concept of selective neutrality can be interpreted 
as a differentiated nucleotide distribution for mutant sites when compared to 
the overall nucleotide distribution. Tajima (1989) has proposed the so-called 
Tajima's test of selective neutrality based on 72 — Tl. Its complex empirical 
behavior (Kiihl, 2005) motivates us to propose a test statistic solely based on 
72 . We are thus able to prove asymptotic normality under different assumptions 
on the number of sequences and number of sites via (7-statistics theory. 



1. Introduction 

A large number of metrics has been constructed to measure genetic distances. Ex- 
amples of such metrics are the Gini-Simpson index of diversity (Gini [iS], Simp- 
son [27] and Sen [2.3]), Nei, Mahalanobis and Hamming distances (Rao [20, 21] 
and Ghakraborty and Rao [2]). Much work has been done using these measures 
of genetic distances to test homogeneity of genetic data or genetic polymorphism 
(Pinheiro et al. [19], Sen [24] and Pinheiro et al. [15]). 

The Gini-Simpson index can be used to build a sub-additive analysis of variance 
for categorical data (Pinheiro et al. [17]). Moreover, tests of homogeneity among 
groups of genomic sequences can be constructed (Pinheiro et al. [15, 18] and Pin- 
heiro et al. [1')]) based on Hamming distance inequalities (Sen [23]). 

In the study of population evolution, one is interested in the population's degree 
of polymorphism. This can be expressed as the rate of mutation per locus per 
generation. Some tests for selective neutrality include Tajima [2!)], Fu [5], Fu and 
Li [7] and Fu [()] . A communality to these tests is their intrinsic dependence on the 
Jukes and Cantor type of mutation processes (Jukes-Gantor [12]). However, there 
is abundant evidence for on the inadequacy of assumptions such as equal rates of 
mutation for every site (Fitch and Margoliash [4] , Yang [3)4] and Uzzell and Gorbin 
[32]). 



'Supported in part by CNPq Grants 474329/2004-6, 476781/2004-3 and Fapesp Grant 
2003/10105-2. 

^Departamento de Estatfstica UNICAMP, Caixa Postal 6065 CEP 13083-970, Campinas SP, 
Brazil, e-mail: pinheiro@inie.unicainp.br; hildete@inie.unicajnp.br 

AMS 2000 subject classifications: Primary 62G10, 62G20; secondary 62P10. 
Keywords and phrases: asymptotic normality, (/-statistics, population evolution. 



377 



378 



A. Pinheiro, H. P. Pinheiro and S. Kiihl 



Another characteristic of genomic data is the challenge posed to parametric 
models by the enormously large number of sites, K, with relatively smaller sampled 
sequences, n. These statistical problems and some solutions for related measures 
are discussed by Sen et al. [2()], Sen [25] and Pinheiro et al. [15]. 

In the study of population evolution, an important parameter of interest is 9 ^ 
ANv, where N is the effective population size and v is the rate of mutation per 
locus per generation. Thus, 6 represents the mean number of mutations per site per 
generation. 

There are many estimators of 6 in the literature (Hartl and Clark [I)]). One of 
them is the mean number of pairwise nucleotide differences, which we will call T2. 
Other estimators of 6 are Ti , based on the number of segregating sites and T3, based 
on the number of singletons. The estimators Ti , T2 and Ta are used in the literature 
to build test statistics for selective neutrality (Tajima [2!)] and Fu and Li [7]). 

The concept of selective neutrality can be interpreted as a differentiated nu- 
cleotide distribution for mutant sites when compared to the overall nucleotide 
distribution. A statistic has been proposed, based on T2 — Ti, for testing such a 
hypothesis (Tajima [2!)]). This, however, has shortcomings, due to the complex be- 
havior of 72 — 7i (Kiihl [l-i]). We propose a test statistic solely based on T2. For 
this statistic, we are able to prove asymptotic normality under different asymptotic 
conditions on the number of sequences and sites and also derive Berry-Esseen rates 
of convergence. 

The text goes as follows. Section 2 reviews the biological motivation and the 
tests for selective neutrality available in the literature. In Section 3 we propose a 
test solely based on the nucleotide frequencies. Its asymptotic behavior is studied 
in Section 4. In Section 5, we illustrate the performances of the proposed test and 
Tajima's procedure in two genetic data sets. 

2. Tajima's Test of Selective Neutrality 

In order to better understand the differences between these estimators, let us sup- 
pose that we have a sample of 5 DNA sequences, from which 500 sites were se- 
quenced. Table 1 presents only the 16 polymorphic or segregating sites, i.e., those 
sites in which we find nucleotide differences. The other 484 sites which do not 
present differences are called non-segregating sites. Among the polymorphic sites, 
sites 3, 5, 6, 7, 8, 11, 12, 13 and 15 are singletons, since they present only one 
nucleotide different from the others. 

Let S be the number of segregating sites, 72 be the mean number of pairwise 
differences and S* be the number of singletons. In the example given in Table 1, 
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Polymorphic sites in a sample of five genes 
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* : number of pairwise difTerences for each site. 
There are 16 segregating sites. 
Singletons are presented in boldface. 
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5= 16, 

^ 6 + 6 + 4+--- + 6 + 4 + 6 

T2 = = 4.94 

16 

and S* = 8. The main difference between 5* and T2 is the effect of selection. 

The motivation and interpretation of the test of neutral mutation is that nu- 
cleotide polymorphism (segregating sites) and nucleotide diversity (pairwise nu- 
cleotide differences) differ primarily because the segregating sites are indifferent to 
the relative frequencies of the polymorphic nucleotides. 

Mutant nucleotides are maintained in the population with low frequency. As the 
number of segregating sites ignores the frequency of mutant nucleotides, its value 
can be strongly affected by their existence, even if they occur with low frequency. 
On the other hand, the existence of mutant nucleotides with low frequency does 
not affect the mean number of pairwise differences, since in this case the frequency 
of mutations is considered. In other words, if some of the observed mutations have 
selective effects, the estimator 71 of 6, based on 5, cannot be the same as 72. Major 
discrepancies occur when: 

• The relative frequencies of polymorphic variants are almost identical (nearly 
equal). This pattern increases the proportion of nucleotide pairwise differ- 
ences, hence 72 — 71 is positive. This suggests either some type of balancing 
selection, in which heterozygous genotypes are favored, or some type of di- 
versifying selection, in which genotypes carrying the less common alleles are 
favored; 

• The relative frequencies of the polymorphic variants are too unequal, with an 
excess of the most common type and a deficiency of the less common types. 
This pattern results in a decrease in the proportion of pairwise differences, 
so 72 — 7i is negative. Typical reasons for excessively unequal frequencies 
can be selection against genotypes carrying the less frequent alleles, recent 
population bottleneck eliminating less frequent alleles, and insufficient time 
since the occurrence of the bottleneck to restore the equilibrium between 
mutation and random genetic drift. 

The most common and well-known test of selective neutrality in the literature 
is Tajima's D test (Tajima [2-)]). This test uses the following statistic to test the 
hypothesis of neutral mutation, which is known as Tajima's D statistic: 

(2.1) D 

where Di = T2 — Ti, for which 

n+l „ 



(2.2) Var(i:>i) 



3(7i - 1) 



v/Var(i?i) 

2(n2 + n + 3) n + 2 h„ 



9n(7i — 1) a„ n 



.2 



with a„ = Ej=i and 6„ = (V.?^)- 

Note that Di is not a t/-statistic, i.e. E{Ti) is not an estimable parameter in 
Hocffding's sense, since Ti estimates it by using the sample as a whole. Heuris- 
tic motivation is provided by Tajima [29] to use Beta distribution tables for the 
asymptotic behavior of D. In Kiihl [13] the theoretical characteristics of D are 
carefully studied. It is shown that D^s asymptotic distribution is infinitely divisible 
but strong theoretical evidence is provided against asymptotic normality. In view 
of the difficulties in dealing with the theoretical distribution of the use of re- 
sampling methods, such as bootstrap, are recommended to generate its empirical 
distribution and to compute the p- value of the test. 
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3. The test of selective neutrality mutations based on nucleotide 
frequencies 

Based on the interpretation of the vaguely defined alternative hypothesis discussed 
in Section 2, we can thinlc of a hypothesis driven by differences on the nucleotide 
frequencies for segregating and non-segregating sites. 

Let Hci be the probability of having category (nucleotide) c at site I, with c = 
1, . . . ,4; I = 1, . . . , K and K is the number of segregating sites. Let He be the overall 
probability of having category (nucleotide) c in non-segregating sites. 

Now, let Xi = {Xii, . . . , Xifc)' and Xj = {Xji, . . . , Xjx)' be random vectors rep- 
resenting DNA sequences i and j. So, Xu can assume values in the set {A, C, T, G}, 
where A represents Adenine, C, Cytosine, T, Thymine and G, Guanine. As in 
Pinheiro et al. [15] we write 



1 ^ 

(3.1) D,, ^ -J2^iXa ^ X,i) 

1=1 

the proportion of sites where X^ and Xj differ; here 1(^4) stands for the indicator 
function of set A. Consider 

1 ^ 



K 

1=1 



K 4 ^ K A 



(3-2) = ^EEneKi-n.) = i-lEEn^^ 



1 = 1 c=l 1=1 c=l 



A natural, unbiased and optimal nonparametric estimator of Ti.K is 72, a JJ-statistic 
(Hoeffding [111]) of degree 2, given by 



1 / N -1 ^ 
1 / 



(3.3) = J2 E KXu^X,^ 

1 = 1 l<i<j<n 



K\2, 

We can write, under the null hypothesis of neutral mutation. 



(3.4) J?o : E(T2) = 1 - 7^ E E ^ ^o, 

I I teU c=i 

where M is the set of non-segregating sites and \M\ is its cardinality. 
Using the Hoeffding-decomposition (Hoeffding [10]), we have 

(3.5) T2=nK + 2Hi'^ +Hi^\ 

where H^^^ = ^"^^ E{D,j \ Xj) - Hk, and H^^^ = Ta - 2H^P + Hk is the 
degenerate component of order 2. 

If af, the variance of E{Dij \ Xj), is positive, 7^ is a nondegenerate [/-statistic 
of degree 2, and by Hoeffding [10], 

(3.6) ni/2(T2-HK) ^N(0,4a2). 



So, in our case, if we assume the same conditions as in Tajima [2!)], that is, 
independence among sites and sequences, we get asymptotic normality for the test 
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statistic T2, under either Hq or Hi. Moreover, since the kernel is bounded, usual 
fourth moment conditions easily hold and Berry-Esseen results are straight forward 
albeit tedious and cumbersome. More powerful tools are provided in Pinheiro et al. 
[16, 19] and Pinheiro et al. [15], which we can somewhat mimic to relax some of 
the initial conditions of Tajima [29]. 



4. Asymptotics of the neutral selectivity test 

Motivated by the discussion in Section 3, we define 



(4.1) 



1 ^ 

^2 = ;^EEl(^«^^^^-fc 



{2,j)k=l 

72 is a non-degenerate [/-statistic of degree 2, for which 



(4.2) 



1 ^ 



fc=i 



(4.3) Vi(Xi) 

(4.4) h^'\Xi) 
/^<'^(Xi,X2) 

(4.5) 
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K 



lf;i(x,,^x,,)-i + lf;n^,. 



fc=i 

K 



k=l 



K C 



k=l 



k=l c=l 



where IIcA; — P{Xik — c), for fc = 1, . . . , X and c— 1, . . . , C. 
Note that 



^ ( K C K C 



(4.6) 



I fc=l c=l 
K C 

EEn?.n: 

k=l c,d=l 



k^l = l c,d=l 



Suppose, for instance, that the K sites are independently distributed. Then (4.6) 
reduces to 

( K c K c 1 

-1 = ]^ EEnL-EEnM. , 

yk=ic=i k=icA=i J 

which will be zero if and only if Ilcfc = 1/C, c=l,...,C and k = 1, . . . , /v or, if for 
each /c = 1, . . . , A' 3c e {1, . . . C} such that Ilcfe = 1. We can then assure that 7^ 
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is a non-degenerate U-statistic of degree 2 unless either all sites' distributions are 
degenerate or uniform, which can be generally classified as non-interesting cases for 
genetic (or otherwise) data. If, however, one is interested in such null hypotheses, 
we refer the reader to Pinheiro et al. [f5] and Pinheiro ct al. [I'l] for related issues 
in a somewhat different approach. For a more general setup in which both null and 
alternative hypotheses have associated generalized degenerate (quasi) U-statistics, 
we refer the reader to Pinheiro et al. [16]. 
We also know that 



Hk 



K C 



(4.7) 



k=l c=l 

and its H-decomposition is given by 



K c 



^^n,,.(i-n,,.) 



k=l c=l 



n ^ — ' 

1=1 

We can then write 



5:/^(2)(x„x,). 



•t<i 



i.e., the asymptotic behavior of V Kn{T2 — Hk) depends only on the asymptotic 
behavior of the sum 



1 

z„ = V V 

If we write Z„ = X]"=i ^ni, where 
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Xi, k 



En^ 



we need to show the CLT for the array {Yni,n > 1,1 < i < n}, for k = k{n). 

If K is finite, Hk = n and CLT's for r.v.'s wiU suffice. Since Var{T2) > 0, 
Hoeffding [l()]'s result can be applied and 
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If, however, K varies with n, note that 
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n 



Xi k- 



c ] ^ 

T,c=i n^fe \ = 0{K) as K 00, and therefore 4 = 0(1) as 71 ^ 00 (and either 
K hmited or K 00). On the other hand, 
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Using the same mixing rate, a{k) = 7 £^|y„i|'* = 0{n ^) as n — > 00 (and 
either K hmited or K ^ 00). Therefore, 

/ n \2 

(4.9) 



By (4.9), Liapoimov's CLT necessary conditions are attained and, therefore, 

^—72 — Hk d 
vnK 

(4.10) 



2cri 



iV(0,l) 



as n — !• cx) (and either K hmited or K s- 00), 
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following Utev [31]. 

Wc should note that the convergence given by (4.10) is true also \i K ^ oo but n 
is limited, simply using the CLT for mixing r.v.'s (Withers [•>•>]). Therefore, we have 
asymptotic normality for n — > oo and/or K oo. The only (sufficient) condition for 
that is mixing along the sequences if K is large. Anyhow, such a hypothesis is much 
less restrictive than the sequencewisc independence, equal rates of mutation for all 
sites, infinite number of sites, and Poisson distribution for the number of mutant 
sites taken by Tajima [29] and yet not sufficient for the asymptotic normality of D. 

We can also assess the rate of convergence of the [/-statistics' CLT by the ap- 
propiate generalization of Berry-Esseen's original results. For instance, if n — > oo 
and K finite, Korolyuk and Borovskikh [14] proves that 



(4.11) 



T2 - Ti-K 



< X 



< C((7f3£;|/i(i)(Xi)|3 + a 



-5/3 



i?|/l(2)(Xi,X2)|5/3U-l/' 



In our case, al is given by (4.6), E\h^^\^i)\'^ < {Eh^^^XiYy , which is given 
by (4.12), and £:|/i(2) (Xi, Xs)!'^/^ < {Eh^^\Xi,X2)^f/^, which is given by (4.13). 
We can write 



(4.12) X 
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and 



Eh^^\Xi,X2f 

I khk ^ck^dkii - ncfc)(i - lidk) 



K C 



k=l c,d=l 
K C K C 



k=l c,d=l 
K C 
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k^l=l c,d=l 



(4.13) 

For independent sites, (4.12) reduces to 



(^fe=ld=l \ c=l 

K C / C 
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k=l \d=l \ c=l / 



(4.14) 

and (4.13) becomes 



if ^ ^ 

— <^ kHk -EE ^ck^dkii - n,fe)(i - Hdk) 

[ k=lc.,d=l 



K C 



K C 



(4.15) 



+2 5:^n^,nL-2^^n^, 



fc=lc,rf=l fc=l c=l 

Likewise, ]i K oo and finite n, under mixing conditions, we get 



(4.16) 



P 



^/Var^2 



< X - $(.t) 



where ri{K) will be K~^^^ log K or slower than K~^/^ depending on the mixing rates 
being exponential or polynomial on if, respectively, since the random variables are 
all bounded (Tihomirov [od]). Under site independence, the rate will be iiT"^/^ 
(Feller [,3]). 

ifo, defined by (3.4), can be tested by 



(4.17) 
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Note that under Hq, r„ is asymptotically A^(0, 1) while, under Hi, T„ = Zn + 
V nK{Ti.K ^ ^o)/ y/VarT2 , where Zn is asymptotically N{0, 1) and T„ — Z„ = 0{nK). 
Therefore, when n — > cxd and/or A' — > oo, the test defined by rejecting Hq when 
Tn > Qa has an asymptotic size a and an asymptotic power one for all alternatives 
for which ^/nK{TiK — Sq) / \/Var72 ~> oo: that will happen whenever K is finite 
and Hk - 6*0 > or when K oo and < Hk - 0o ^ 0{n~^ K^^C{n, K)), 
where ({n,K) oo. We can also deal with Pitman alternatives, i.e, for which 
< Hk — Oq ^ 0{n^^K~^). Analogous reasonings will work for the test defined 
by Tn < ~qa or |r„| > qa/2- 

5. Applications 

We illustrate the test statistic defined by (4.1) in two data sets. The first is composed 
by sequences of Hydromedusa maximiliani, a neotropical freshwater turtle from 
the Atlantic Forest's rivers in southeast Brazil (Souza et al. [-S]). The sample is 
composed by n = 48 sequences oi K = 262bp on the b cytochrome mitochondrial 
DNA. The second data set is composed by rt = 12 HIV sequences from a single 
infected patient (Holmes and Brown [I I]). Each sequence has K = 233bp. 

For the computation of the variance of 7^, we employed its jackknifc estimator 
(Sen [22] and Arvesen [1]): 

(5.1) Var{T2) = n^{n-l)r 1 "^{cn-m^)S„ 

where Sc = ^ (/)(Xij , Xi2)(/)(Xi3 , X^^), for any resample {ii, ^2, *3, ^4} from {1, 
. . . , n} such that there are c coincident indices; c ^ 0, . . . ,m. 

For the turtle data set, we find an asymptotic p-value of 2.021 x 10~^^^ (test 
statistic equals —34.41). For the HIV data set, we find an asymptotic p-value of 
3.131 X 10^^*^ (teste statistic equals —9.14). Therefore, there is very strong statistical 
evidences for negative selection for both the turtle and the HIV data sets. On the 
other hand, if one uses Tajima's D statistic, one will decide on negative selection 
for the turtle data but on neutral selectivity for the HIV data with p-values 0.0397 
and 0.7298, respectively (Kiihl [13]), even though for both data sets, the observed 
D statistic is negative. 

In order to understand these test results, we should recall that the literature's 
accepted interpretation of neutral selectivity is that there is not much change on 
the sitewise nucleotide distribution for the mutant sites when compared to the 
non-mutant ones. Likewise, negative (positive) selectivity is interpreted as a signif- 
icant bias towards a more concentrated nucleotide distribution (towards the uni- 
form discrete distribution) for the segregating sites when compared to the non- 
segregating ones. We will take the observed frequencies as 11 = (Hyi, He, Hgj Ht)'- 
For the turtle data set, the non-segregating sites have observed frequencies given 
by Utur,ns = (0.3913,0.2727,0.2372,0.0988)' while IItur,s = (0.8571,0.1429, 
0.0000, 0.0000)' for the segregating sites. For the HIV data set, the non-segregating 
sites have Uhiv.ns = (0.4550,0.1327,0.1706,0.2417)' and wc get Uhiv.s 
(0.4091, 0.1364, 0.4545, 0.0000)' for the segregating sites. One should also note that 
(1 - Il'^jjj^ gIlTUR.,s) - (1 - tl'^uR^Ns'^TUit.,Ns) = -0.4615 and (1 - n^/y.s ^ 
iiHiv,s) — (1 — ^'hiv NS^Hiv,Ns) = —0.0804, which clearly point to negative 
selectivity in both datasets. 
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To show the behavior of the test statistic under positive selectivity, we have 
artificially substituted the nucleotide frequencies for the seven segregating sites on 
the turtle data set by nucleotides closer to uniformly distributed ones. We have for 
the segregating sites an overall nucleotide frequency of .25, varying sitewise from 
11/48 to 15/48 for each nucleotide class. The test value is then 5.7348 with a p- 
value of 9.7627 x 10~^. We again reject the null hypothesis of selective neutrality, 
this time from a positive selectivity point of view. 

One can also notice large changes in the observed frequencies between segregate 
and non-segregate sites in both examples, which agrees with the proposed test's 
conclusions. Tajima's test, however, leads to a different result for the HIV data. We 
should also point out that, since the jackknifed estimate of variance is positively 
biased (Sen [22] and Arvesen [1]), our results are conservative towards the null 
hypotheses, which strengthens even more the superior performance of the proposed 
test over Tajima's for both data sets. 

6. Conclusions 

We presented the inherent flaws associated with Tajima's test for selective neutral- 
ity, among them the vague definition of the null hypothesis (neutral selectivity), the 
(possibly) non-normal asymptotic distribution of D, the small number of presumed 
independent sites and the theoretically large number of sequences. We proposed a 
null hypothesis which formalizes the vague notions contained in Tajima's ideas for 
neutral selectivity. Moreover, due to [/-statistics H-dccomposition, we are able to 
provide our test with normal asymptotics, under a broader setup than those con- 
sidered in the literature. The attained relaxations include: mixing positions (if K is 
large) or any dependence setup (if K is finite) instead of independently distributed 
positions; large n and/or K for normal asymptotics instead of large K for motiva- 
tion but only large n for non-normal asymptotics. Resampling schemes which can 
be possibly cumbersome are easily circumvented by a direct formula for jackknifed 
[/-statistics proposed by Sen [22] and Arvesen [1]. We illustrate the superior per- 
formance of the proposed test statistic to Tajima's in two data sets. In one of the 
data sets, we get a different conclusion (which is more reasonable when looking at 
other descriptive statistics). For the other data set, we come to the same conclusion, 
but the p-value is much smaller which is again biologically and statistically more 
reasonable. Summarizing, the proposed test uses all the advantages of [/-statistics 
asymptotics, can be employed in a more general setup, and its application is quite 
simple due to jackknife variance estimation. 

Acknowledgments. The authors would like to thank the editors and reviewers 
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